Logo

1.Introduction

Disease is an unavoidable component of existence, impacting not only people but all living things. Everyone will encounter one or more diseases over their lives, whether inherited or induced by external circumstances. To battle many diseases produced by external chemicals that enter our systems, our bodies have built-in immune responses and medicinal therapies.

The human species has experienced countless diseases throughout history, and as our lives change, diseases develop and take on new shapes. The recent Covid-19 outbreak exemplifies the deadly impact a disease had impact on a worldwide scale. Pandemics, on the other hand, are not a new phenomena; throughout history, the globe has been shook by many illnesses that spread and reduce human life expectancy.

This Analysis will look at a large Dataset that has information on 50 illnesses that occurred in every state of the United States between 1888 and 2014. The information comes from Project Tycho, which works with researchers and national and international health institutes to provide open, free data for public use in analysis and research.

When dealing with such a massive volume of data, visualization becomes a vital tool. Visualizing data with graphical representations allows for a more meaningful comprehension and insights than combing over massive Excel spreadsheets.

2.Problem Defenition

The data from Tycho 2.0 may be subjected to analysis to determine whether any secrets are there. This data presents a plethora of possibilities for research on 50 illnesses that had an impact on the United States, but it must be restricted to certain locations in order to achieve any high-quality analysis from this study.

The study’s goal is to pinpoint the most common illness that afflicted Americans from 1888 to 2014 and to ascertain which states were most impacted. The study attempts to identify the illness type with the greatest prevalence rate nationwide and investigate its effects on various states by evaluating the data.

3.Objectives

The purpose is to explore how infections spread among the 50 states in the United States. The first step is deciphering and comprehending data formats, columns, and the data itself. This information will be useful for performing statistical analysis, making visualizations, and deriving conclusions about the relationships between events and their effects. One may gain a thorough grasp of how diseases spread by looking at these connections.

4.Methods

Statistical language - R offers simple and quick tools for transforming data into aesthetically interesting components such as graphs. The graphs make the data easier to read and comprehend. This is a list of the several sorts of graphs that are plotted here with descriptive statistical methods and ggplot2.

  1. Map plot: Used to display the data in the US map using Plotly package and Tmap and Mapview.

  2. Geom bar with Geom Points: Used to display the data in bars and geom with mean value.

  3. Scatter Plot: Represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between years and cases or deaths.

  4. Grouped Bar plot: A grouped bar chart re-presents categorical data with rectangular bars with heights or lengths proportional to the values with comparison of two or more variables that they represent.

  5. Heat Map: A heat map is a two-dimensional representation of data in which values are represented by color gradients.

  6. Box plot with Animation: A box plot shows the distribution of continues data and it visualizes five summary statistics. with animation in plot it creates easy to understand representation.

4.1 Libraries

Below is a list of libraries that were utilized in this analysis.

library (ggplot2)
library (dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library (tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ✔ readr     2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library (tmap, mapview)
## The legacy packages maptools, rgdal, and rgeos, underpinning this package
## will retire shortly. Please refer to R-spatial evolution reports on
## https://r-spatial.org/r/2023/05/15/evolution4.html for details.
## This package is now running under evolution status 0
library (plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library (viridis)
## Loading required package: viridisLite
library (ggridges)
library (readr)
library (usmap)
library (gapminder)
library (ggthemes)
library (gganimate)
library (png)
library (gifski)
options(scipen=999999) # to disables scientific notion

4.2 Data

The code below is used to set the directory where work is done and then create a new file with the data that wants to be evaluated.

setwd("K:/AI & DS/Data Visualisation/Tycho2") # Setting a directory for Project Work space. 

US <- read.csv("ProjectTycho_Level2_v1.1.0.csv", header = T, stringsAsFactors = T) # read.csv function loads a data in work space.

4.3 Data Cleaning & Preparing

The next stage is to examine the data and examine its size and contents. As the data is sorted and cleaned up, it is ready for analysis.

dim(US) # It shows dimension of our data frame.
## [1] 3659360      11
US <- US %>% arrange(epi_week) # Arranging data according years.
tail(US) # Showing Bottom of our data.
US <- US[ , -2] # removing Unwanted columns

US <- US[ , -10] 
str(US) # looking at data types of our data set  
## 'data.frame':    3659360 obs. of  9 variables:
##  $ epi_week : int  188737 188823 188823 188823 188824 188824 188824 188824 188824 188824 ...
##  $ state    : Factor w/ 57 levels "AK","AL","AR",..: 20 39 39 39 42 42 42 39 39 42 ...
##  $ loc      : Factor w/ 630 levels "ABERDEEN","ADAMS",..: 325 123 123 123 465 465 465 123 123 465 ...
##  $ loc_type : Factor w/ 2 levels "CITY","STATE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ disease  : Factor w/ 50 levels "ANTHRAX","BABESIOSIS",..: 46 46 36 11 46 36 11 46 11 38 ...
##  $ event    : Factor w/ 2 levels "CASES","DEATHS": 2 2 2 2 2 2 2 2 2 2 ...
##  $ number   : int  3 1 1 5 14 4 4 3 5 3 ...
##  $ from_date: Factor w/ 10557 levels "1887-09-09","1888-06-03",..: 1 2 2 2 3 3 3 3 3 3 ...
##  $ to_date  : Factor w/ 10557 levels "1887-09-15","1888-06-09",..: 1 2 2 2 3 3 3 3 3 3 ...
US <- US |> filter(!is.na(number)) # filtering N/A value.
US <- subset(US, !(state %in% c ("AS", "GU", "MP", "PR", "PT", "VI"))) # Deleting Union territory from State Column
US$to_date <- as.Date(US$to_date, format = "%Y-%m-%d") # changing the format of factor to date

US$from_date <- as.Date(US$from_date, format = "%Y-%m-%d")
str(US) # checking the data type of each column 
## 'data.frame':    3650344 obs. of  9 variables:
##  $ epi_week : int  188737 188823 188823 188823 188824 188824 188824 188824 188824 188824 ...
##  $ state    : Factor w/ 57 levels "AK","AL","AR",..: 20 39 39 39 42 42 42 39 39 42 ...
##  $ loc      : Factor w/ 630 levels "ABERDEEN","ADAMS",..: 325 123 123 123 465 465 465 123 123 465 ...
##  $ loc_type : Factor w/ 2 levels "CITY","STATE": 1 1 1 1 1 1 1 1 1 1 ...
##  $ disease  : Factor w/ 50 levels "ANTHRAX","BABESIOSIS",..: 46 46 36 11 46 36 11 46 11 38 ...
##  $ event    : Factor w/ 2 levels "CASES","DEATHS": 2 2 2 2 2 2 2 2 2 2 ...
##  $ number   : int  3 1 1 5 14 4 4 3 5 3 ...
##  $ from_date: Date, format: "1887-09-09" "1888-06-03" ...
##  $ to_date  : Date, format: "1887-09-15" "1888-06-09" ...
sort(summary(US$state)) # Total entries per each state 
##     AS     GU     MP     PR     PT     VI     AK     HI     WY     MS     NV 
##      0      0      0      0      0      0   5272   8436  11875  15383  18084 
##     SD     AZ     ID     NM     DE     ND     ME     UT     VT     DC     OR 
##  26406  27839  28626  29091  30749  36092  36830  37563  38571  40058  41847 
##     OK     NE     AR     NH     RI     IA     SC     WV     KY     CO     FL 
##  42851  44666  45668  47375  52831  58812  59588  59984  61325  62440  62674 
##     KS     LA     NC     MT     AL     WA     GA     TN     MD     MN     CT 
##  64734  66134  71351  72488  74015  74067  75527  78327  80560  81114  87596 
##     MO     VA     MI     IL     WI     IN     TX     NJ     CA     OH     PA 
##  88793  95074 110275 111040 111533 117738 120102 137949 139075 168281 175361 
##     NY     MA 
## 186258 232016
sort(summary(US$disease)) # Total Entry on each disease.
##                            ENCEPHALITIS                                BOTULISM 
##                                       5                                      10 
##                               VARIOLOID                              BABESIOSIS 
##                                      21                                      36 
##                            YELLOW FEVER                                 CHOLERA 
##                                      75                                     125 
##               EHRLICHIOSIS/ANAPLASMOSIS                                  DENGUE 
##                                     136                                     621 
##                            TRICHINIASIS                             PSITTACOSIS 
##                                     737                                     824 
##                      COCCIDIOIDOMYCOSIS                    TOXIC SHOCK SYNDROME 
##                                    1066                                    1256 
##                                 TETANUS STREPTOCOCCAL DISEASE, INVASIVE GROUP A 
##                                    1732                                    3306 
##               STREPTOCOCCAL SORE THROAT                            LYME DISEASE 
##                                    4587                                    4706 
##                           LEGIONELLOSIS                                 ANTHRAX 
##                                    6686                                    7051 
##                       CRYPTOSPORIDIOSIS                                 LEPROSY 
##                                    7589                                    7677 
##                             SHIGELLOSIS                              GIARDIASIS 
##                                    8651                                   10142 
##                               DYSENTERY                            TYPHUS FEVER 
##                                   10493                                   11865 
##                                 MALARIA                           SALMONELLOSIS 
##                                   12216                                   12737 
##            BRUCELLOSIS [UNDULANT FEVER]                               TULAREMIA 
##                                   14957                                   15984 
##                               CHLAMYDIA            ROCKY MOUNTAIN SPOTTED FEVER 
##                                   16524                                   18655 
##                               GONORRHEA                                 RUBELLA 
##                                   23268                                   25843 
##                              MENINGITIS                       RABIES IN ANIMALS 
##                                   29759                                   30304 
##                                PELLAGRA                             HEPATITIS B 
##                                   33244                                   49667 
##                             HEPATITIS A                  CHICKENPOX [VARICELLA] 
##                                   50526                                   74434 
##                                   MUMPS                           POLIOMYELITIS 
##                                   87771                                  140606 
##                               PNEUMONIA              WHOOPING COUGH [PERTUSSIS] 
##                                  213518                                  229798 
##                               INFLUENZA                 PNEUMONIA AND INFLUENZA 
##                                  236673                                  239793 
##                                SMALLPOX           TYPHOID FEVER [ENTERIC FEVER] 
##                                  274862                                  304320 
##                           SCARLET FEVER      TUBERCULOSIS [PHTHISIS PULMONALIS] 
##                                  344805                                  346515 
##                                 MEASLES                              DIPHTHERIA 
##                                  354973                                  379195
Events <- US %>% filter(number >= 1) # Separating values which shows at least one case or death.
Events$epi_week <- substr(Events$epi_week,1,4)
colnames(Events)[1] <- 'year' # Removing epi_week and making year column instead of year+ week.
Events$year <- as.numeric(Events$year)
Event_Case <- subset(Events, event == "CASES", select = c(year, state, disease, number))# Separating Cases from events column 
Event_Death <- subset(Events, event == "DEATHS", select = c(year, state, disease, number ))# # Separating Deaths from events column 

US map 1 Cases

This first graphical plot, an interactive choropleth map, displays all cases for all 50 States in the United States from 1888 to 2014 year wise. It explains why particular locations or sets of states, like New York, have lighter hues more frequently in the US’s northwest. West coast states like Texas and California give the south a menacing vibe. States from the US’s core geographic position, on the other hand, have dark hues, signifying fewer incidences between 1888 and 2014.

# Preparing Data for Graph
year_state_Case <- Event_Case %>% select(year, state, number)
year_state_Case <- aggregate(year_state_Case, number ~ year + state, sum)

# Plotting a Graph according filtered Data
plot_geo(year_state_Case, locationmode = 'USA-states', frame = ~ year) %>% 
  add_trace(locations = ~ state, z = ~number, zmin = 0, zmax = max(year_state_Case$number), color = ~number, colorscale = 'PuBu') %>%
  layout(geo = list(scope = 'usa'), title = "Choropleth map for Cases with all disease by each year")

US map 2 Death

Only cases are shown in the plot at the top. It is advised that mortality rates for all diseases combined be compared for each state in the US in order to ascertain if the data is distributed throughout all states or not. Comparing cases and deaths for each state upholds the idea that a spike in cases would lead to a rise in fatalities, however certain states (WI, WA) don’t do this, leading us to assume that the prevalence of high fatality illnesses is uneven. Below plot just represents apposite events means deaths on each state of US.

# Preparing Data for Graph
year_state_Death <- Event_Death %>% select(year, state, number)
year_state_Death <- aggregate(year_state_Death, number ~ year + state, sum)

# Plotting a Graph according filtered Data 
plot_geo(year_state_Death, locationmode = 'USA-states', frame = ~ year) %>% 
  add_trace(locations = ~ state, z = ~number, zmin = 0, zmax = max(year_state_Death$number), color = ~number, colorscale = 'Electric') %>%
  layout(geo = list(scope = 'usa'), title = "Choropleth map for Deaths with all disease by each year ")

Bar Plot + Geom Point on Mean value of Cases

Two plots are merged for the outcome in this plot. The mean value for the cases of each disease is displayed in the plot below. The results shows that Chlamydia disease has the greatest mean value and Yell fewer disease has the lowest mean value.

# Summary of mean value to all Disease
Mean_Dise <- Event_Case %>% group_by(disease) %>% summarise(number = mean(number)) # summarizing mean value
    ggplot(Mean_Dise, aes(number, disease)) + # assigning axis value
    geom_bar(stat = "identity", fill = 'gray35', alpha = 0.8, width = 0.6) + geom_point(color = "gray0", size = 3) + # arrangements 
    theme(axis.text.x = element_text(angle = 30, hjust = 1)) + 
    theme_minimal() + # theme of the plot
    labs(title ="Mean bar graph on Cases", subtitle = 'log10 scale on x axis' , x = 'Number of Cases') + 
    scale_x_log10(breaks = c(10,20,40,80,160,320)) # scale factor for x axis

Scatter Plot-1 for Cases

Over the duration of the observation period, the scatter plot compares the cases.This graph uses a log10 scale to compare all Dieses cases across all 50 states. Between 1930 to 1960, there was an excess of cases in every state as per result. There is Data missing for 4 year from 2002 to 2005. The cases increase again after that, this time around year 2010.

ggplot(data = year_state_Case) +
  geom_jitter(aes(x =year, y = number, color = state), alpha=.4, size = 1) + 
  labs(title ='Comparing Cases according states by year', subtitle = 'log10 on Y axis' , y = 'Number of Cases') +
  scale_y_log10() +
  theme_minimal() +
  theme(legend.position = c(0))

Scatter Plot-2 for Deaths

This graph compares the Deaths throughout the course of the observation period using a scatter plot. This graph compares all deaths events among all 50 states using a log10 scale.As a result, the Record does not contain any death information from 1948 until around 1965. So we may claim that the data record provided is not entirely accurate.

ggplot(data = year_state_Death) +
  geom_jitter(aes(x =year, y = number, color = state), alpha=.4, size = 1) + 
  labs(title ='Comparing Death according states by year', 
       subtitle = 'log10 on Y axis' , y = 'Number of Deaths') +
  scale_y_log10() + 
  theme_minimal() +
  theme(legend.position = c(0))

Grouped Bar plot 1

The diseases with up to 50,000 documented cases are represented in this bar graph, along with the cases and deaths connected to each ailment. 25 diseases with up to 50,000 cases were found in the studies.Due to the fact that many diseases have fewer recorded cases than reported fatalities, the data is not entirely accurate.The shocking part is that there have been roughly 40,000 documented deaths from influenza and pneumonia combined when there were 0 reported cases.

Dise <- Events %>% 
        count(disease,event, wt = number) %>% 
        pivot_wider(names_from = event, values_from = n, values_fill = 0)

Dise50_Plus <- Dise[Dise$CASES >= 50000, c("disease","CASES","DEATHS")] %>% 
               pivot_longer(cols = DEATHS : CASES, names_to = "event", values_to = "number")
Dise50 <- Dise[Dise$CASES < 50000, c("disease","CASES","DEATHS")] %>% 
          pivot_longer(cols = DEATHS : CASES, names_to = "event", values_to = "number")
ggplot(Dise50, aes(x = reorder(disease, number) , y = reorder(number, disease),  fill= event))+ # Assigning axis 
  geom_bar(position='dodge', stat='identity') + # Dodge plot 
  scale_fill_manual(values = c("#CCCC00", "#666600")) + # color 
  labs(title ="Up to 50,000 cases and deaths reported per Disease", x = 'Disease' , y = 'Numbers') +  # Title of the plot
  theme(text = element_text(size = 8),axis.text.x = element_text(angle = 30, hjust = 1),legend.position = c(0.1, 0.8)) # Theme with adjustment 

Grouped Bar plot 2

This Grouped bar graph shows the diseases with more than 50,000 reported instances, In comparison of cases and deaths associated with each disease. In the findings, 24 illnesses with more than 50,000 cases were identified. Moreover, the records show that there were 277655 reported cases from the measles which is highest. Additionally, according to the provided data, there are around 70,000 cases of pneumonia and 73,000 deaths from the disease.Therefore, it is clear that more deaths from pneumonia than cases have been reported. Therefore, this information might not be totally accurate for some disease.

ggplot(Dise50_Plus, aes(x = reorder(disease, number) , y = reorder(number, disease),  fill= event))+ # Assigning axis 
  geom_bar(position='dodge', stat='identity') + # Dodge plot 
  scale_fill_manual(values = c("#3366CC", "#333399")) + # color
  labs(title ="More than 50,000 cases and deaths reported per Disease", x = 'Disease' , y = 'Numbers') + # Title of the plot
  theme(text = element_text(size = 8),axis.text.x = element_text(angle = 30, hjust = 1),legend.position = c(0.1, 0.8))  # Theme with adjustment 

Heat map

The heat map shows how almost all states were greatly influenced by all illnesses. Almost all states were affected, where if we compare with scatterplot 1 this plot gives more detailed view for affection over state on particulate year. The darker hue in the outcome indicates greater influence.

year_state_Case$Create <- cut(year_state_Case$number,  breaks = c(0,100,1000,10000,100000,1000000,5000000))
# Heatmap
ggplot(year_state_Case, aes(x=year, y = state , fill= number))+ # assigning axis 
      geom_tile(aes(fill= Create)) +
      labs(title ='Heatmap of Cases on states per year', y = 'States') +
      scale_fill_manual(name = "Numbers", values = c("#FF9966","#FF3333","#CC0000","#990000","#660000", "#330000"), 
                        labels = c("^100", "^1000", "^10,000","^1,00,000","^10,00,000") ) + # color with labels 
      theme(text = element_text(size = 8)) # theme of the plot

Events$number <- is.integer(Events$number)
Both_Event <- Events %>% 
          select( state, disease, event, number)  %>% 
          pivot_wider(names_from = event, values_from = number , values_fn = {sum}) %>% # Transferring rows to cols
          group_by(CASES, DEATHS) %>% 
          filter(CASES  > 0 & DEATHS > 0 ) %>%   # Removing 0 values
          pivot_longer( cols = DEATHS : CASES, names_to = "event", values_to = "number") # Again Transferring cols to rows

Box Plot

The bellowed box plot shows comparison between cases and deaths for 14 disease. To compare Fatal disease we compare cases with dates using box plot. With box Plot it can be Identified that CHICKENPOX [VARICELLA] and BRUCELLOSIS [UNDULANT FEVER] recorded very High cases as compare to deaths so this disease aren’t deadly But the PNEUMONIA and TUBERCULOSIS [PHTHISIS PULMONALIS] have more death cases recorded as compare to cases which can be seen from quartile values of the box.

Box <- ggplot(Both_Event, aes(x = disease, y = number, fill = event)) +
  geom_boxplot( width = 0.6)  + scale_y_log10() +
  labs(x = "Disease", y = "Number") + scale_fill_manual(values= c("#666600", "#990000")) +
  ggtitle("Comparison of Deaths and Cases by Disease") +  
  theme(text = element_text(size = 10),axis.text.x = element_text(angle = 15, hjust = 1),legend.position = c(0.95, 0.9)) 
  
Box.animation = Box + # box plot with Animation
   transition_states(disease, wrap = FALSE) + 
   shadow_mark(alpha = 0.7) + 
   enter_grow() +
   exit_fade() + 
   ease_aes('back-out') 

Box.animation # To see animation plot   

5. Results

After this nominal analysis we found that there are several points to consider as a result. First fall data aren’t accurate or reliable for guarantied analysis because there are more than 50 states plus some disease have given only number of cases and some have have only deaths which is surly questionable that how figures of deaths noticed without figures of cases.

New York and Texas are highly affected states by cases as compare to other states. But in terms of mortality there are less impact on Texas compare to mortality rate in New York.

In terms of disease Measles have the highest cases approx 25309922, Also there isn’t any deaths recorded for Measles so we can’t create our hypothesis on the mortality impacts by Measles.

In addition there are several years haven’t recorded in Tycho project which is also a big factor to inhibits further analysis.

6. Discussion

The information provided by the US health department was practically defective and comprised state names, dates, and hours. In order to extract useful insights from the data, the numbers were further separated into events that discriminated between cases and fatalities as well as contained city and state names. The exact weekly time spans of the data allowed for analysis based on a series of occurrences.

The actual data shows no missing numbers, however a closer look reveals that the data lacks information on the death rates associated with a number of illnesses. just mention a few, there are chlamydia, gonorrhea, measles, mumps, and rubella. However, data on some illnesses only include fatalities. Consider the diseases varioloid, cholera, influenza and pneumonia.

When just cases were included, the research results showed that measles was the most common illness; however, if data on fatalities had been added, the technique may have yielded more accurate forecasts of mortality owing to measles.

7. Conclusion

This graphical study has shown that numerous illnesses, including meningitis, tuberculosis, pneumonia, and pellagra, have a high mortality rate in the United States, but the data shows that measles is the most prevalent disease among the 50.

8. Refrences

(https://www.youtube.com/watch?v=RrtqBYLf404)

(https://ggplot2.tidyverse.org/)

(https://derekogle.com/NCGraphing/img/colorbyhex.png)

(https://dplyr.tidyverse.org/)